layout: true <div class="my-footer" style="width: auto"> <img src="http://agbu.une.edu.au/img/AGBU_logo_2017.jpg" /> </div> --- class: middle, center ## tidyverse and data.table ### Phillip Gurman ### 3/07/2019 --- # Tidyverse <img src="https://www.tidyverse.org/images/hex-tidyverse.png" style = "position: fixed; right: 10px; height: 118px"/> https://rviews.rstudio.com/2017/06/08/what-is-the-tidyverse/ * "The tidyverse is a coherent system of packages for data manipulation, exploration and visualization that share a common design philosophy." * https://www.rstudio.com/resources/cheatsheets/ --- background-image: url(http://www.storybench.org/wp-content/uploads/2017/05/tidyverse.png) background-position: 50% 50% background-size: 100% class: center, bottom --- # R Package Downloads <!-- --> --- # data.table <img src="https://github.com/Rdatatable/data.table/wiki/icons/sticker.png" style = "position: fixed; right: 10px; height: 118px"/> https://github.com/Rdatatable/data.table/wiki * Designed for fast operations on in memory data. * Notationally more compact and less english like. * More similiar to base R notation. * Quicker than base R and dplyr, with many operations implemented in parallel. * Dates back to 2006. https://h2oai.github.io/db-benchmark/index/plots/groupby.1e8_1e2_0_0.png --- # dplyr <img src="https://d33wubrfki0l68.cloudfront.net/071952491ec4a6a532a3f70ecfa2507af4d341f9/c167c/images/hex-dplyr.png" style = "position: fixed; right: 10px; height: 118px"/> * R package to perform manipulation of data stored in a `data.frame()` ## Basic dplyr verbs * `select()`: focus on a subset of variables * `filter()`: focus on a subset of rows * `mutate()`: add new columns * `summarise()` and `group_by()`: reduce each group to a smaller number of summary statistics * `arrange()`: sorts rows Simple notation, which can be used for databases --- # Example dataset diamonds ```r str(diamonds) ``` ``` ## 'data.frame': 53940 obs. of 10 variables: ## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ... ## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ... ## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ... ## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ... ## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ... ## $ table : num 55 61 65 58 58 57 57 55 61 61 ... ## $ price : int 326 326 327 334 335 336 336 337 337 338 ... ## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ... ## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ... ## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ... ``` --- # Tibble <img src="https://www.jumpingrivers.com/wp-content/uploads/2018/06/tibbles.png" style = "position: fixed; right: 10px; height: 118px"/> The tidyverse has a package called `tibble`, which extends the standard `data.frame()`. * Formats the standard data print message, shows the first 10 rows with information about data types. * Prevents killing R by printing too larger object. * Tibbles will not guess what you mean when use type in a partial name. ```r (diamonds <- as_tibble(diamonds)) ``` ``` ## # A tibble: 53,940 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 ## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 ## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 ## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 ## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 ## # ... with 53,930 more rows ``` --- # Data Table Display * Also shows a formatted data output of the first and last 5 rows. * Internally can also stores indexes for columns. ```r (diamonds_dt <- as.data.table(diamonds)) ``` ``` ## carat cut color clarity depth table price x y z ## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2: 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4: 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 ## 5: 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## --- ## 53936: 0.72 Ideal D SI1 60.8 57 2757 5.75 5.76 3.50 ## 53937: 0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61 ## 53938: 0.70 Very Good D SI1 62.8 60 2757 5.66 5.68 3.56 ## 53939: 0.86 Premium H SI2 61.0 58 2757 6.15 6.12 3.74 ## 53940: 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64 ``` --- # Take a subset of columns Take only the `carat` `cut` and `clarity` columns. .pull-left[ ### dplyr ```r select(diamonds,carat, cut, clarity) ``` ``` ## # A tibble: 53,940 x 3 ## carat cut clarity ## <dbl> <ord> <ord> ## 1 0.23 Ideal SI2 ## 2 0.21 Premium SI1 ## 3 0.23 Good VS1 ## 4 0.290 Premium VS2 ## 5 0.31 Good SI2 ## 6 0.24 Very Good VVS2 ## 7 0.24 Very Good VVS1 ## 8 0.26 Very Good SI1 ## 9 0.22 Fair VS2 ## 10 0.23 Very Good VS1 ## # ... with 53,930 more rows ``` ] .pull-right[ ### data.table ```r diamonds_dt[, .(carat, cut, clarity)] ``` ``` ## carat cut clarity ## 1: 0.23 Ideal SI2 ## 2: 0.21 Premium SI1 ## 3: 0.23 Good VS1 ## 4: 0.29 Premium VS2 ## 5: 0.31 Good SI2 ## --- ## 53936: 0.72 Ideal SI1 ## 53937: 0.72 Good SI1 ## 53938: 0.70 Very Good SI1 ## 53939: 0.86 Premium SI2 ## 53940: 0.75 Ideal SI2 ``` ] ## --- # Modify a columns Conver the x column to inches (why? Maybe your American) .pull-left[ .small[ ### dplyr `mutate()` * Can be performed in place `x = x` or into new column. * Can used just calculated column in next calculation. Convert the length column `x` from mm to inches and then create `a` as half `x`. ```r mutate(diamonds, x = x / 25.4, a = x / 2) ``` ``` ## # A tibble: 53,940 x 11 ## carat cut color clarity depth table price x y z a ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 0.156 3.98 2.43 0.0778 ## 2 0.21 Premium E SI1 59.8 61 326 0.153 3.84 2.31 0.0766 ## 3 0.23 Good E VS1 56.9 65 327 0.159 4.07 2.31 0.0797 ## 4 0.290 Premium I VS2 62.4 58 334 0.165 4.23 2.63 0.0827 ## 5 0.31 Good J SI2 63.3 58 335 0.171 4.35 2.75 0.0854 ## 6 0.24 Very Good J VVS2 62.8 57 336 0.155 3.96 2.48 0.0776 ## 7 0.24 Very Good I VVS1 62.3 57 336 0.156 3.98 2.47 0.0778 ## 8 0.26 Very Good H SI1 61.9 55 337 0.160 4.11 2.53 0.0801 ## 9 0.22 Fair E VS2 65.1 61 337 0.152 3.78 2.49 0.0762 ## 10 0.23 Very Good H VS1 59.4 61 338 0.157 4.05 2.39 0.0787 ## # ... with 53,930 more rows ``` ] ] .pull-right[ .small[ ### data.table NB: Inplace modify ```r temp <- diamonds_dt temp[, x := x / 25.4] temp ``` ``` ## carat cut color clarity depth table price x y z ## 1: 0.23 Ideal E SI2 61.5 55 326 0.1555118 3.98 2.43 ## 2: 0.21 Premium E SI1 59.8 61 326 0.1531496 3.84 2.31 ## 3: 0.23 Good E VS1 56.9 65 327 0.1594488 4.07 2.31 ## 4: 0.29 Premium I VS2 62.4 58 334 0.1653543 4.23 2.63 ## 5: 0.31 Good J SI2 63.3 58 335 0.1708661 4.35 2.75 ## --- ## 53936: 0.72 Ideal D SI1 60.8 57 2757 0.2263780 5.76 3.50 ## 53937: 0.72 Good D SI1 63.1 55 2757 0.2240157 5.75 3.61 ## 53938: 0.70 Very Good D SI1 62.8 60 2757 0.2228346 5.68 3.56 ## 53939: 0.86 Premium H SI2 61.0 58 2757 0.2421260 6.12 3.74 ## 53940: 0.75 Ideal D SI2 62.2 55 2757 0.2295276 5.87 3.64 ``` ] ] --- # Sorting Sort the data by highest price first .pull-left[ .small[ ### dplyr `arrange(df, column)` * Default ascending order * Can be wrapped in desc() to achieve descending order. ```r arrange(diamonds, desc(price)) ``` ``` ## # A tibble: 53,940 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16 ## 2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04 ## 3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56 ## 4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11 ## 5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01 ## 6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24 ## 7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84 ## 8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91 ## 9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7 ## 10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21 ## # ... with 53,930 more rows ``` ]] .pull-right[ .small[ ### data.table ```r diamonds_dt[order(-price)] ``` ``` ## carat cut color clarity depth table price x y z ## 1: 2.29 Premium I VS2 60.8 60 18823 0.3346457 8.47 5.16 ## 2: 2.00 Very Good G SI1 63.5 56 18818 0.3110236 7.97 5.04 ## 3: 1.51 Ideal G IF 61.7 55 18806 0.2901575 7.41 4.56 ## 4: 2.07 Ideal G SI2 62.5 55 18804 0.3228346 8.13 5.11 ## 5: 2.00 Very Good H SI1 62.8 57 18803 0.3129921 8.00 5.01 ## --- ## 53936: 0.31 Good J SI2 63.3 58 335 0.1708661 4.35 2.75 ## 53937: 0.29 Premium I VS2 62.4 58 334 0.1653543 4.23 2.63 ## 53938: 0.23 Good E VS1 56.9 65 327 0.1594488 4.07 2.31 ## 53939: 0.23 Ideal E SI2 61.5 55 326 0.1555118 3.98 2.43 ## 53940: 0.21 Premium E SI1 59.8 61 326 0.1531496 3.84 2.31 ``` ]] --- # Group By * Splits the data into chunks to group together to operate on within groups. * For example, the count, mean, min and max of the prices by cut. .pull-left[ .small[ ### dplyr ```r summarise( group_by(diamonds, cut), n = n(), mean(price), min(price), max(price)) ``` ``` ## # A tibble: 5 x 5 ## cut n `mean(price)` `min(price)` `max(price)` ## <ord> <int> <dbl> <int> <int> ## 1 Fair 1610 4359. 337 18574 ## 2 Good 4906 3929. 327 18788 ## 3 Very Good 12082 3982. 336 18818 ## 4 Premium 13791 4584. 326 18823 ## 5 Ideal 21551 3458. 326 18806 ``` ]] .pull-right[ .small[ ### data.table ```r diamonds_dt[ , .(n = .N, mean = mean(price), min = min(price), max = max(price)), by = cut] ``` ``` ## cut n mean min max ## 1: Ideal 21551 3457.542 326 18806 ## 2: Premium 13791 4584.258 326 18823 ## 3: Good 4906 3928.864 327 18788 ## 4: Very Good 12082 3981.760 336 18818 ## 5: Fair 1610 4358.758 337 18574 ``` ]] --- # pipe command <img src="https://magrittr.tidyverse.org/logo.png" style = "position: fixed; right: 10px; height: 118px"/> ### `%>%` * Reorders code in left to right manner * So that `select(table, columns)` becomes `table %>% select(columns)` *i.e. the object on the left gets passsed as the first argument into the function. * Keyboard shortcut for RStudio Ctrl-Shift-M * Allows multiple commands to be chained together resulting in more readable code left to right or up to down. * Required to achieve more complex data manipulation * Part of the `magrittr` package. * Package also contains some other useful functions for working with dataframes in pipes. * e.g `set_colnames()` or `set_rownames()` * There are other types of pipes also, but with less obvious use cases. --- # pipe example Previous group by using a pipe before we pass the data into a formatter. .small[ ```r diamonds %>% group_by(cut,color) %>% summarise(n = n(), mean = mean(price), min = min(price), max = max(price)) %>% DT::datatable(class = 'compact') ```
] --- # Static ```r diamonds %>% group_by(cut,color) %>% summarise(n = n(), mean = mean(price), min = min(price), max = max(price)) %>% kable() %>% kable_styling() ``` <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> cut </th> <th style="text-align:left;"> color </th> <th style="text-align:right;"> n </th> <th style="text-align:right;"> mean </th> <th style="text-align:right;"> min </th> <th style="text-align:right;"> max </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Fair </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 163 </td> <td style="text-align:right;"> 4291.061 </td> <td style="text-align:right;"> 536 </td> <td style="text-align:right;"> 16386 </td> </tr> <tr> <td style="text-align:left;"> Fair </td> <td style="text-align:left;"> E </td> <td style="text-align:right;"> 224 </td> <td style="text-align:right;"> 3682.312 </td> <td style="text-align:right;"> 337 </td> <td style="text-align:right;"> 15584 </td> </tr> <tr> <td style="text-align:left;"> Fair </td> <td style="text-align:left;"> F </td> <td style="text-align:right;"> 312 </td> <td style="text-align:right;"> 3827.003 </td> <td style="text-align:right;"> 496 </td> <td style="text-align:right;"> 17995 </td> </tr> <tr> <td style="text-align:left;"> Fair </td> <td style="text-align:left;"> G </td> <td style="text-align:right;"> 314 </td> <td style="text-align:right;"> 4239.255 </td> <td style="text-align:right;"> 369 </td> <td style="text-align:right;"> 18574 </td> </tr> <tr> <td style="text-align:left;"> Fair </td> <td style="text-align:left;"> H </td> <td style="text-align:right;"> 303 </td> <td style="text-align:right;"> 5135.683 </td> <td style="text-align:right;"> 659 </td> <td style="text-align:right;"> 18565 </td> </tr> <tr> <td style="text-align:left;"> Fair </td> <td style="text-align:left;"> I </td> <td style="text-align:right;"> 175 </td> <td style="text-align:right;"> 4685.446 </td> <td style="text-align:right;"> 735 </td> <td style="text-align:right;"> 18242 </td> </tr> <tr> <td style="text-align:left;"> Fair </td> <td style="text-align:left;"> J </td> <td style="text-align:right;"> 119 </td> <td style="text-align:right;"> 4975.655 </td> <td style="text-align:right;"> 416 </td> <td style="text-align:right;"> 18531 </td> </tr> <tr> <td style="text-align:left;"> Good </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 662 </td> <td style="text-align:right;"> 3405.382 </td> <td style="text-align:right;"> 361 </td> <td style="text-align:right;"> 18468 </td> </tr> <tr> <td style="text-align:left;"> Good </td> <td style="text-align:left;"> E </td> <td style="text-align:right;"> 933 </td> <td style="text-align:right;"> 3423.644 </td> <td style="text-align:right;"> 327 </td> <td style="text-align:right;"> 18236 </td> </tr> <tr> <td style="text-align:left;"> Good </td> <td style="text-align:left;"> F </td> <td style="text-align:right;"> 909 </td> <td style="text-align:right;"> 3495.750 </td> <td style="text-align:right;"> 357 </td> <td style="text-align:right;"> 18686 </td> </tr> <tr> <td style="text-align:left;"> Good </td> <td style="text-align:left;"> G </td> <td style="text-align:right;"> 871 </td> <td style="text-align:right;"> 4123.482 </td> <td style="text-align:right;"> 394 </td> <td style="text-align:right;"> 18788 </td> </tr> <tr> <td style="text-align:left;"> Good </td> <td style="text-align:left;"> H </td> <td style="text-align:right;"> 702 </td> <td style="text-align:right;"> 4276.255 </td> <td style="text-align:right;"> 368 </td> <td style="text-align:right;"> 18640 </td> </tr> <tr> <td style="text-align:left;"> Good </td> <td style="text-align:left;"> I </td> <td style="text-align:right;"> 522 </td> <td style="text-align:right;"> 5078.533 </td> <td style="text-align:right;"> 351 </td> <td style="text-align:right;"> 18707 </td> </tr> <tr> <td style="text-align:left;"> Good </td> <td style="text-align:left;"> J </td> <td style="text-align:right;"> 307 </td> <td style="text-align:right;"> 4574.173 </td> <td style="text-align:right;"> 335 </td> <td style="text-align:right;"> 18325 </td> </tr> <tr> <td style="text-align:left;"> Very Good </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 1513 </td> <td style="text-align:right;"> 3470.467 </td> <td style="text-align:right;"> 357 </td> <td style="text-align:right;"> 18542 </td> </tr> <tr> <td style="text-align:left;"> Very Good </td> <td style="text-align:left;"> E </td> <td style="text-align:right;"> 2400 </td> <td style="text-align:right;"> 3214.652 </td> <td style="text-align:right;"> 352 </td> <td style="text-align:right;"> 18731 </td> </tr> <tr> <td style="text-align:left;"> Very Good </td> <td style="text-align:left;"> F </td> <td style="text-align:right;"> 2164 </td> <td style="text-align:right;"> 3778.820 </td> <td style="text-align:right;"> 357 </td> <td style="text-align:right;"> 18777 </td> </tr> <tr> <td style="text-align:left;"> Very Good </td> <td style="text-align:left;"> G </td> <td style="text-align:right;"> 2299 </td> <td style="text-align:right;"> 3872.754 </td> <td style="text-align:right;"> 354 </td> <td style="text-align:right;"> 18818 </td> </tr> <tr> <td style="text-align:left;"> Very Good </td> <td style="text-align:left;"> H </td> <td style="text-align:right;"> 1824 </td> <td style="text-align:right;"> 4535.390 </td> <td style="text-align:right;"> 337 </td> <td style="text-align:right;"> 18803 </td> </tr> <tr> <td style="text-align:left;"> Very Good </td> <td style="text-align:left;"> I </td> <td style="text-align:right;"> 1204 </td> <td style="text-align:right;"> 5255.880 </td> <td style="text-align:right;"> 336 </td> <td style="text-align:right;"> 18500 </td> </tr> <tr> <td style="text-align:left;"> Very Good </td> <td style="text-align:left;"> J </td> <td style="text-align:right;"> 678 </td> <td style="text-align:right;"> 5103.513 </td> <td style="text-align:right;"> 336 </td> <td style="text-align:right;"> 18430 </td> </tr> <tr> <td style="text-align:left;"> Premium </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 1603 </td> <td style="text-align:right;"> 3631.293 </td> <td style="text-align:right;"> 367 </td> <td style="text-align:right;"> 18575 </td> </tr> <tr> <td style="text-align:left;"> Premium </td> <td style="text-align:left;"> E </td> <td style="text-align:right;"> 2337 </td> <td style="text-align:right;"> 3538.914 </td> <td style="text-align:right;"> 326 </td> <td style="text-align:right;"> 18477 </td> </tr> <tr> <td style="text-align:left;"> Premium </td> <td style="text-align:left;"> F </td> <td style="text-align:right;"> 2331 </td> <td style="text-align:right;"> 4324.890 </td> <td style="text-align:right;"> 342 </td> <td style="text-align:right;"> 18791 </td> </tr> <tr> <td style="text-align:left;"> Premium </td> <td style="text-align:left;"> G </td> <td style="text-align:right;"> 2924 </td> <td style="text-align:right;"> 4500.742 </td> <td style="text-align:right;"> 382 </td> <td style="text-align:right;"> 18741 </td> </tr> <tr> <td style="text-align:left;"> Premium </td> <td style="text-align:left;"> H </td> <td style="text-align:right;"> 2360 </td> <td style="text-align:right;"> 5216.707 </td> <td style="text-align:right;"> 368 </td> <td style="text-align:right;"> 18795 </td> </tr> <tr> <td style="text-align:left;"> Premium </td> <td style="text-align:left;"> I </td> <td style="text-align:right;"> 1428 </td> <td style="text-align:right;"> 5946.181 </td> <td style="text-align:right;"> 334 </td> <td style="text-align:right;"> 18823 </td> </tr> <tr> <td style="text-align:left;"> Premium </td> <td style="text-align:left;"> J </td> <td style="text-align:right;"> 808 </td> <td style="text-align:right;"> 6294.592 </td> <td style="text-align:right;"> 363 </td> <td style="text-align:right;"> 18710 </td> </tr> <tr> <td style="text-align:left;"> Ideal </td> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 2834 </td> <td style="text-align:right;"> 2629.095 </td> <td style="text-align:right;"> 367 </td> <td style="text-align:right;"> 18693 </td> </tr> <tr> <td style="text-align:left;"> Ideal </td> <td style="text-align:left;"> E </td> <td style="text-align:right;"> 3903 </td> <td style="text-align:right;"> 2597.550 </td> <td style="text-align:right;"> 326 </td> <td style="text-align:right;"> 18729 </td> </tr> <tr> <td style="text-align:left;"> Ideal </td> <td style="text-align:left;"> F </td> <td style="text-align:right;"> 3826 </td> <td style="text-align:right;"> 3374.939 </td> <td style="text-align:right;"> 408 </td> <td style="text-align:right;"> 18780 </td> </tr> <tr> <td style="text-align:left;"> Ideal </td> <td style="text-align:left;"> G </td> <td style="text-align:right;"> 4884 </td> <td style="text-align:right;"> 3720.706 </td> <td style="text-align:right;"> 361 </td> <td style="text-align:right;"> 18806 </td> </tr> <tr> <td style="text-align:left;"> Ideal </td> <td style="text-align:left;"> H </td> <td style="text-align:right;"> 3115 </td> <td style="text-align:right;"> 3889.335 </td> <td style="text-align:right;"> 357 </td> <td style="text-align:right;"> 18760 </td> </tr> <tr> <td style="text-align:left;"> Ideal </td> <td style="text-align:left;"> I </td> <td style="text-align:right;"> 2093 </td> <td style="text-align:right;"> 4451.970 </td> <td style="text-align:right;"> 348 </td> <td style="text-align:right;"> 18779 </td> </tr> <tr> <td style="text-align:left;"> Ideal </td> <td style="text-align:left;"> J </td> <td style="text-align:right;"> 896 </td> <td style="text-align:right;"> 4918.186 </td> <td style="text-align:right;"> 340 </td> <td style="text-align:right;"> 18508 </td> </tr> </tbody> </table> --- # Window functions Group by functions can also be applied per row. Say we want to retrieve the top 0.1% of diamonds by price for each cut type. ```r diamonds %>% group_by(cut) %>% filter(price > quantile(price, 0.999)) ``` ``` ## # A tibble: 55 x 10 ## # Groups: cut [5] ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 2.02 Fair H VS2 64.5 57 18565 8 7.95 5.14 ## 2 2.01 Fair G SI1 70.6 64 18574 7.43 6.64 4.69 ## 3 1.49 Ideal F VVS2 61.1 58 18614 7.36 7.38 4.5 ## 4 2.01 Good G SI1 60.3 60 18625 8.07 8.18 4.9 ## 5 2.03 Very Good G SI1 60.1 60 18630 8.09 8.18 4.89 ## 6 2.01 Good H VS2 63.3 55 18640 7.87 7.94 5 ## 7 2.09 Very Good F SI2 63.4 55 18640 8.18 8.11 5.16 ## 8 2.1 Ideal H SI1 62.9 57 18648 8.19 8.12 5.13 ## 9 2.05 Ideal H VS2 61.6 57 18659 8.2 8.14 5.03 ## 10 2.04 Ideal G SI1 59 57 18663 8.32 8.37 4.92 ## # ... with 45 more rows ``` --- # Standard normal of prices by cut ```r diamonds %>% group_by(cut) %>% mutate(price_std = (price - mean(price)) / sd(price)) %>% select(cut, price, price_std) %>% ungroup() %>% sample_n(20) ``` ``` ## # A tibble: 20 x 3 ## cut price price_std ## <ord> <int> <dbl> ## 1 Premium 421 -0.957 ## 2 Ideal 2415 -0.274 ## 3 Good 2058 -0.508 ## 4 Very Good 3465 -0.131 ## 5 Ideal 676 -0.730 ## 6 Premium 4639 0.0126 ## 7 Premium 4412 -0.0396 ## 8 Ideal 1273 -0.574 ## 9 Ideal 765 -0.707 ## 10 Very Good 778 -0.814 ## 11 Ideal 764 -0.707 ## 12 Ideal 2447 -0.265 ## 13 Premium 7274 0.618 ## 14 Very Good 530 -0.877 ## 15 Fair 9996 1.58 ## 16 Very Good 1410 -0.653 ## 17 Ideal 729 -0.716 ## 18 Very Good 3755 -0.0576 ## 19 Very Good 354 -0.922 ## 20 Ideal 530 -0.769 ``` --- # tidyr <img src="https://d33wubrfki0l68.cloudfront.net/5f8c22ec53a1ac61684f3e8d59c623d09227d6b9/b15de/images/hex-tidyr.png" style = "position: fixed; right: 10px; height: 118px"/> * It is also possible to change the shape of data dataframe from a long format to wide or long to wide. * In the tidyverse, the package is called `tidyr` and has two function of note `gather` and `spread`. * `gather` pulls multiple columns together into two columns, one column that records the original column names (keys) and the second the values. If you gather 3 columns, then the resulting data frame will be 3 times longer. * `spread` takes key-value pairs and spreads them out into mulitple columns. * More info at [https://r4ds.had.co.nz/tidy-data.html](https://r4ds.had.co.nz/tidy-data.html) --- # Gather  --- # Spread  --- # tidy examples Convert diamonds dimensions to long format ```r diamonds %>% select(price, x,y,z) %>% gather(dim, measure, -price) %>% head(10) ``` ``` ## # A tibble: 10 x 3 ## price dim measure ## <int> <chr> <dbl> ## 1 326 x 3.95 ## 2 326 x 3.89 ## 3 327 x 4.05 ## 4 334 x 4.2 ## 5 335 x 4.34 ## 6 336 x 3.94 ## 7 336 x 3.95 ## 8 337 x 4.07 ## 9 337 x 3.87 ## 10 338 x 4 ``` --- # Plot ```r diamonds %>% select(price, x,y,z) %>% gather(dim, measure, -price) %>% ggplot(aes(measure, price, colour = dim)) + geom_point() + scale_x_log10() + scale_y_log10() ``` <!-- --> --- # File IO * Big data files (say > 1gb) are slow to read in base R using either (read.csv or read.delim). * Options: * `readr::read_csv`, tidyverse option * `data.table`'s `fread`, parallel file reader targeted @ speed * `vroom::vroom`, lazy file reader, indexes where the data is in the file. Supports fixed width. <img src="https://d33wubrfki0l68.cloudfront.net/25c7688c73021b5e7adcf6d105552b97c27d696c/c77d0/articles/2019-05-vroom-1-0-0_files/figure-html/benchmark_plot-1.png" style="width: 50%; display: block; margin-left: auto; margin-right: auto;" /> --- # Databases (dbplyr) .pull-left[ * `dplyr` can work with databases using the same commands as previous. For example, say that we placed the diamonds data in a sqlite database ```r library(DBI) file.remove("diamonds_db.sqlite3") ``` ``` ## [1] TRUE ``` ```r diamonds_db <- dbConnect(RSQLite::SQLite(), "diamonds_db.sqlite3") dbWriteTable(diamonds_db, "diamonds", diamonds) ``` then we can query this data without retreiving it into memory ] .pull-right[ .small[ ```r diamonds_db %>% tbl("diamonds") %>% group_by(color, cut) %>% summarise(pricemean = mean(price, na.rm=TRUE)) ``` ``` ## # Source: lazy query [?? x 3] ## # Database: sqlite 3.22.0 [C:\UserData\My Files\AGBU\R support ## # group\dplyr\diamonds_db.sqlite3] ## # Groups: color ## color cut pricemean ## <chr> <chr> <dbl> ## 1 D Fair 4291. ## 2 D Good 3405. ## 3 D Ideal 2629. ## 4 D Premium 3631. ## 5 D Very Good 3470. ## 6 E Fair 3682. ## 7 E Good 3424. ## 8 E Ideal 2598. ## 9 E Premium 3539. ## 10 E Very Good 3215. ## # ... with more rows ``` ] ]